Sep 14, 2022
Reminder: Links to course materials and main sites (Piazza, Canvas, Github) can be found on the home page of the main course website:
https://musa-550-fall-2022.github.io/
Week #2 repository: https://github.com/MUSA-550-Fall-2022/week-2
Recommended readings for the week listed here
Last time
Today
# The imports
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
So many tools...so little time
You'll use different packages to achieve different goals, and they each have different things they are good at.
Today, we'll focus on:
And next week for geospatial data:
Goal: introduce you to the most common tools and enable you to know the best package for the job in the future
We'll use the object-oriented interface to matplotlib
Create Figure and Axes objects
Add plots to the Axes object
Customize any and all aspects of the Figure or Axes objects
Pro: Matplotlib is extraordinarily general — you can do pretty much anything with it
Con: There's a steep learning curve, with a lot of matplotlib-specific terms to learn

We'll use the Palmer penguins data set, data collected for three species of penguins at Palmer station in Antartica

Artwork by @allison_horst
# Load data on Palmer penguins
penguins = pd.read_csv("./data/penguins.csv")
penguins.head(n=10)
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
|---|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male | 2007 |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female | 2007 |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female | 2007 |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN | 2007 |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female | 2007 |
| 5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | male | 2007 |
| 6 | Adelie | Torgersen | 38.9 | 17.8 | 181.0 | 3625.0 | female | 2007 |
| 7 | Adelie | Torgersen | 39.2 | 19.6 | 195.0 | 4675.0 | male | 2007 |
| 8 | Adelie | Torgersen | 34.1 | 18.1 | 193.0 | 3475.0 | NaN | 2007 |
| 9 | Adelie | Torgersen | 42.0 | 20.2 | 190.0 | 4250.0 | NaN | 2007 |
Data is already in tidy format
I want to scatter flipper length vs. bill length, colored by the penguin species
# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Color for each species
color_map = {"Adelie": "#1f77b4", "Gentoo": "#ff7f0e", "Chinstrap": "#D62728"}
# Group the data frame by species and loop over each group
# NOTE: "group" will be the dataframe holding the data for "species"
for species, group in penguins.groupby("species"):
print(f"Plotting {species}...")
# Plot flipper length vs bill length for this group
ax.scatter(
group["flipper_length_mm"],
group["bill_length_mm"],
marker="o",
label=species,
color=color_map[species],
alpha=0.75,
)
# Format the axes
ax.legend(loc="best")
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel("Bill Length (mm)")
ax.grid(True)
# Show
plt.show()
Plotting Adelie... Plotting Chinstrap... Plotting Gentoo...
pandas?¶# Tab complete on the plot attribute of a dataframe to see the available functions
#penguins.plot.scatter?
# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# Calculate a list of colors
color_map = {"Adelie": "#1f77b4", "Gentoo": "#ff7f0e", "Chinstrap": "#D62728"}
colors = [color_map[species] for species in penguins["species"]]
# Scatter plot two columns, colored by third
penguins.plot.scatter(
x="flipper_length_mm",
y="bill_length_mm",
c=colors,
alpha=0.75,
ax=ax, # Plot on the axes object we created already!
)
# Format
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel("Bill Length (mm)")
ax.grid(True)
Note: no easy way to get legend added to the plot in this case...
pandas plotting capabilities are good for quick and unpolished plots during the data exploration phaseimport seaborn as sns
# Initialize the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))
# style keywords as dict
color_map = {"Adelie": "#1f77b4", "Gentoo": "#ff7f0e", "Chinstrap": "#D62728"}
style = dict(palette=color_map, s=60, edgecolor="none", alpha=0.75)
# use the scatterplot() function
sns.scatterplot(
x="flipper_length_mm", # the x column
y="bill_length_mm", # the y column
hue="species", # the third dimension (color)
data=penguins, # pass in the data
ax=ax, # plot on the axes object we made
**style # add our style keywords
)
# Format with matplotlib commands
ax.set_xlabel("Flipper Length (mm)")
ax.set_ylabel("Bill Length (mm)")
ax.grid(True)
ax.legend(loc='best')
<matplotlib.legend.Legend at 0x14398e0a0>
The ** syntax is the unpacking operator. It will unpack the dictionary and pass each keyword to the function.
So the previous code is the same as:
sns.scatterplot(
x="flipper_length_mm",
y="bill_length_mm",
hue="species",
data=penguins,
ax=ax,
palette=color_map, # defined in the style dict
edgecolor="none", # defined in the style dict
alpha=0.5 # defined in the style dict
)
But we can use **style as a shortcut!
In general, seaborn is fantastic for visualizing relationships between variables in a more quantitative way
Don't memorize every function...
I always look at the beautiful Example Gallery for ideas.
How about adding linear regression lines?
Use lmplot()
sns.lmplot(
x="flipper_length_mm",
y="bill_length_mm",
hue="species",
data=penguins,
height=6,
aspect=1.5,
palette=color_map,
scatter_kws=dict(edgecolor="none", alpha=0.5),
);
Use jointplot()
sns.jointplot(
x="flipper_length_mm",
y="bill_length_mm",
data=penguins,
height=8,
kind="kde",
cmap="viridis",
);
Use pairplot()
# The variables to plot
variables = [
"species",
"bill_length_mm",
"flipper_length_mm",
"body_mass_g",
"bill_depth_mm",
]
# Set the seaborn style
sns.set_context("notebook", font_scale=1.5)
# make the pair plot
sns.pairplot(
penguins[variables].dropna(),
palette=color_map,
hue="species",
plot_kws=dict(alpha=0.5, edgecolor="none"),
)
<seaborn.axisgrid.PairGrid at 0x143bf3eb0>
sns.catplot(x="species", y="bill_length_mm", hue="sex", data=penguins);
Great tutorial available in the seaborn documentation
The color_palette function in seaborn is very useful. Easiest way to get a list of hex strings for a specific color map.
viridis = sns.color_palette("viridis", n_colors=7).as_hex()
print(viridis)
['#472d7b', '#3b528b', '#2c728e', '#21918c', '#28ae80', '#5ec962', '#addc30']
sns.palplot(viridis)
You can also create custom light, dark, or diverging color maps, based on the desired hues at either end of the color map.
sns.palplot(sns.diverging_palette(10, 220, sep=50, n=7))
import altair as alt
Important: focuses on tidy data — you'll often find yourself running pd.melt() to get to tidy format
Let's try out our flipper length vs bill length example from last lecture...
# initialize the chart with the data
chart = alt.Chart(penguins)
# define what kind of marks to use
chart = chart.mark_circle(size=60)
# encode the visual channels
chart = chart.encode(
x="flipper_length_mm",
y="bill_length_mm",
color="species",
tooltip=["species", "flipper_length_mm", "bill_length_mm", "island", "sex"],
)
# make the chart interactive
chart.interactive()
Example: previous code is the same as
chart = chart.encode(
x=alt.X("flipper_length_mm"),
y=alt.Y("bill_length_mm"),
color=alt.Color("species"),
tooltip=alt.Tooltip(["species", "flipper_length_mm", "bill_length_mm", "island", "sex"]),
)
alt.Scale() object to specify the scale# initialize the chart with the data
chart = alt.Chart(penguins)
# define what kind of marks to use
chart = chart.mark_circle(size=60)
# encode the visual channels
chart = chart.encode(
x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species",
tooltip=["species", "flipper_length_mm", "bill_length_mm", "island", "sex"],
)
# make the chart interactive
chart = chart.interactive()
chart
For a complete list of these encodings, see the Encodings section of the documentation.
Altair charts can be fully specified as JSON $\rightarrow$ easy to embed in HTML on websites!
# Save the chart as a JSON string!
json = chart.to_json()
# Print out the first 1,000 characters
print(json[:1000])
{
"$schema": "https://vega.github.io/schema/vega-lite/v4.17.0.json",
"config": {
"view": {
"continuousHeight": 300,
"continuousWidth": 400
}
},
"data": {
"name": "data-d00e1631cca48c544438d30d2b470e8a"
},
"datasets": {
"data-d00e1631cca48c544438d30d2b470e8a": [
{
"bill_depth_mm": 18.7,
"bill_length_mm": 39.1,
"body_mass_g": 3750.0,
"flipper_length_mm": 181.0,
"island": "Torgersen",
"sex": "male",
"species": "Adelie",
"year": 2007
},
{
"bill_depth_mm": 17.4,
"bill_length_mm": 39.5,
"body_mass_g": 3800.0,
"flipper_length_mm": 186.0,
"island": "Torgersen",
"sex": "female",
"species": "Adelie",
"year": 2007
},
{
"bill_depth_mm": 18.0,
"bill_length_mm": 40.3,
"body_mass_g": 3250.0,
"flipper_length_mm": 195.0,
"island": "Torgersen",
"sex": "female",
chart.save("chart.html")
# Display IFrame in IPython
from IPython.display import IFrame
IFrame('chart.html', width=600, height=375)
chart = (
alt.Chart(penguins)
.mark_circle(size=60)
.encode(
x=alt.X("flipper_length_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species:N",
)
.interactive()
)
chart
Note that the interactive() call allows users to pan and zoom.
Altair is able to automatically determine the type of the variable using built-in heuristics. Altair and Vega-Lite support four primitive data types:
| Data Type | Code | Description |
|---|---|---|
| quantitative | Q | Numerical quantity (real-valued) |
| nominal | N | Name / Unordered categorical |
| ordinal | O | Ordered categorial |
| temporal | T | Date/time |
You can set the data type of a column explicitly using a one letter code attached to the column name with a colon:
Easily create multiple views of a dataset.
(
alt.Chart(penguins)
.mark_point()
.encode(
x=alt.X("flipper_length_mm:Q", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm:Q", scale=alt.Scale(zero=False)),
color="species:N"
).properties(
width=200, height=200
).facet(column="species").interactive()
)
Note: I've added the variable type identifiers (Q, N) to the previous example
Lots of features to create compound charts: repeated charts, faceted charts, vertical and horizontal stacking of subplots.
See the documentation for examples
A relatively new addition to altair, vega, and vega-lite. This allows you to define what happens when users interact with your visualization.
# create the selection box
brush = alt.selection_interval()
alt.Chart(penguins).mark_point().encode(
x=alt.X(
"flipper_length_mm", scale=alt.Scale(zero=False)
), # x
y=alt.Y(
"bill_length_mm", scale=alt.Scale(zero=False)
), # y
color=alt.condition(
brush, "species", alt.value("lightgray")
), # color
tooltip=["species", "flipper_length_mm", "bill_length_mm"],
).properties(
width=200, height=200, selection=brush
).facet(column="species")
We used the alt.condition() function to specify a conditional color for the markers. It takes three arguments:
brush object determines if abrush, color the marker according to the "species" columnbrush, use the literal hex color "lightgray"Let's examine the relationship between flipper_length_mm, bill_length_mm, and body_mass_g
We'll use a repeated chart that repeats variables across rows and columns.
Use a conditional color again, based on a brush selection.
# Setup the selection brush
brush = alt.selection(type='interval', resolve='global')
# Setup the chart
alt.Chart(penguins).mark_circle().encode(
x=alt.X(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
y=alt.Y(alt.repeat("row"), type='quantitative', scale=alt.Scale(zero=False)),
color=alt.condition(brush, 'species:N', alt.value('lightgray')), # conditional color
).properties(
width=200,
height=200,
selection=brush
).repeat( # repeat variables across rows and columns
row=['flipper_length_mm', 'bill_length_mm', 'body_mass_g'],
column=['body_mass_g', 'bill_length_mm', 'flipper_length_mm']
)
Let's explore the relationship between flipper length, body mass, and sex.
Scatter flipper length vs body mass for each species, colored by sex
alt.Chart(penguins).mark_point().encode(
x=alt.X('flipper_length_mm', scale=alt.Scale(zero=False)),
y=alt.Y('body_mass_g', scale=alt.Scale(zero=False)),
color=alt.Color("sex:N", scale=alt.Scale(scheme="Set2")),
).properties(
width=400, height=150
).facet(row='species')
I've specified the scale keyword to the alt.Color() object and passed a scheme value:
scale=alt.Scale(scheme="Set2")
Set2 is a Color Brewer color. The available color schemes are very similar to those matplotlib. A list is available on the Vega documentation: https://vega.github.io/vega/docs/schemes/.
Next, plot the total number of penguins per species by the island they are found on.
(
alt.Chart(penguins)
.mark_bar()
.encode(
x=alt.X('*:Q', aggregate='count', stack='normalize'),
y='island:N',
color='species:N',
tooltip=['island','species', 'count(*):Q']
)
)
Plot a histogram of number of penguins by flipper length, grouped by species.
(
alt.Chart(penguins)
.mark_bar()
.encode(
x=alt.X('flipper_length_mm', bin=alt.Bin(maxbins=20)),
y='count():Q', #shorthand
color='species',
tooltip=['species', alt.Tooltip('count()', title='Number of Penguins')]
).properties(height=250)
)
Finally, let's bin the data by body mass and plot the average flipper length per bin, colored by the species.
(
alt.Chart(penguins.dropna())
.mark_line()
.encode(
x=alt.X("body_mass_g:Q", bin=alt.Bin(maxbins=10)),
y=alt.Y('mean(flipper_length_mm):Q', scale=alt.Scale(zero=False)), # apply a mean to the flipper length in each bin
color='species:N',
tooltip=['mean(flipper_length_mm):Q', "count():Q"]
).properties(height=300, width=500)
)
In addition to mean() and count(), you can apply a number of different transformations to the data before plotting, including binning, arbitrary functions, and filters.
See the Data Transformations section of the user guide for more details.
# Setup a brush selection
brush = alt.selection(type='interval')
# The top scatterplot: flipper length vs bill length
points = (
alt.Chart()
.mark_point()
.encode(
x=alt.X('flipper_length_mm:Q', scale=alt.Scale(zero=False)),
y=alt.Y('bill_length_mm:Q', scale=alt.Scale(zero=False)),
color=alt.condition(brush, 'species:N', alt.value('lightgray'))
).properties(
selection=brush,
width=800
)
)
# the bottom bar plot
bars = (
alt.Chart()
.mark_bar()
.encode(
x='count(species):Q',
y='species:N',
color='species:N',
).transform_filter(
brush.ref() # the filter transform uses the selection to filter the input data to this chart
).properties(width=800)
)
chart = alt.vconcat(points, bars, data=penguins) # vertical stacking
chart
Exercise: let's reproduce this famous Wall Street Journal visualization showing measles incidence over time.
http://graphics.wsj.com/infectious-diseases-and-vaccines/
# Print out the current working directory
%pwd
'/Users/nhand/Teaching/PennMUSA/Fall2022/week-2'
# List all of the current working directories
%ls
README.md environment.yml lecture-2B.ipynb
chart.html joining_infographic.jpg outline.md
data/ lecture-2A.ipynb
path = './data/measles_incidence.csv' # this is a relative path
data = pd.read_csv(path, skiprows=2, na_values='-')
data.head()
| YEAR | WEEK | ALABAMA | ALASKA | ARIZONA | ARKANSAS | CALIFORNIA | COLORADO | CONNECTICUT | DELAWARE | ... | SOUTH DAKOTA | TENNESSEE | TEXAS | UTAH | VERMONT | VIRGINIA | WASHINGTON | WEST VIRGINIA | WISCONSIN | WYOMING | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1928 | 1 | 3.67 | NaN | 1.90 | 4.11 | 1.38 | 8.38 | 4.50 | 8.58 | ... | 5.69 | 22.03 | 1.18 | 0.4 | 0.28 | NaN | 14.83 | 3.36 | 1.54 | 0.91 |
| 1 | 1928 | 2 | 6.25 | NaN | 6.40 | 9.91 | 1.80 | 6.02 | 9.00 | 7.30 | ... | 6.57 | 16.96 | 0.63 | NaN | 0.56 | NaN | 17.34 | 4.19 | 0.96 | NaN |
| 2 | 1928 | 3 | 7.95 | NaN | 4.50 | 11.15 | 1.31 | 2.86 | 8.81 | 15.88 | ... | 2.04 | 24.66 | 0.62 | 0.2 | 1.12 | NaN | 15.67 | 4.19 | 4.79 | 1.36 |
| 3 | 1928 | 4 | 12.58 | NaN | 1.90 | 13.75 | 1.87 | 13.71 | 10.40 | 4.29 | ... | 2.19 | 18.86 | 0.37 | 0.2 | 6.70 | NaN | 12.77 | 4.66 | 1.64 | 3.64 |
| 4 | 1928 | 5 | 8.03 | NaN | 0.47 | 20.79 | 2.38 | 5.13 | 16.80 | 5.58 | ... | 3.94 | 20.05 | 1.57 | 0.4 | 6.70 | NaN | 18.83 | 7.37 | 2.91 | 0.91 |
5 rows × 53 columns
Note: data is weekly
Hints
groupby() then sum() work flow.WEEK column — you don't need that in the grouping operation
You can use melt() to get tidy data. You should have 3 columns: year, state, and total incidence.
mark_rect() function to encode the values as rectangles and then color them according to the average annual measles incidence per state.You'll want to take advantage of the custom color map defined below to best match the WSJ's graphic.
See the documentation for more information.
For data sources with larger than 5,000 rows, you'll need to run the code below for Altair to work — it forces Altair save a local copy of the data.
alt.data_transformers.enable('json')
DataTransformerRegistry.enable('json')
The categorical color scale choice is properly not the best. It's best to use a perceptually uniform color scale like viridis. See below: